There are two ways (at least):
bootstrap package.Recall the soap data:
Pearson's product-moment correlation
data: speed and scrap
t = 15.829, df = 10, p-value = 2.083e-08
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9302445 0.9947166
sample estimates:
cor
0.9806224
speed-scrap pairs together*** REDO THIS ***
bcanon, write a function that takes a vector of row numbers and returns the correlation between speed and scrap for those rows:[1] 0.9928971
bcanon are now:
line_b)cor.test interval in capturing the skewness of the distribution.Consider this example: samples of UK and Ontario (Canada) children, and their journey times to school, in minutes:
We want to compare the mean journey times in the two different places. This is a two-sample sitation, and if we are not careful with the bootstrap, things will go wrong:
Our original samples were 40 from each location, but by randomly resampling rows, we probably don’t get 40 from each. We need to draw “stratified resamples” to ensure that we get 40 from each place. This is hard to organize with the build-it-yourself bootstrap. To make things easier, we use the rsample package, but then we have to worry about handling the results.
Let’s go back to our IRS data for a moment:
What happens if we use rsample to resample from these? Let’s just do a few to start. rsample has a function bootstraps that does this:
Each of those things in splits is one bootstrap resample. To get at the things in them, we use analysis:
and then unnest the actual samples to see them:
The values in Time are the resampled-with-replacement times to fill in the form.
What we cared about here was the bootstrap distribution of the sample mean, so that for each of the samples in dd we need to find the mean Time in it:
and for example make a histogram of them, to see how normal this is:
This actually looks pretty normal:
all of which suggests that the \(t\)-interval for the mean:
One Sample t-test
data: Time
t = 8.9035, df = 29, p-value = 8.589e-10
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
155.0081 247.4585
sample estimates:
mean of x
201.2333
and some kind of bootstrap interval for the mean, say the percentile-based one:
won’t be all that far apart.
How do I get the BCa interval from this output? First write a function that gets the mean Time from given rows of a data frame:
[1] 132.6667
check. And then feed into bcanon these things:
can I do “stratified resampling”? Yes
that seems to work, but I want to try it on groups of different sizes
resample_by_group <- function(d, var, gp) {
d %>% group_by({{ gp }}) %>%
sample_frac(replace=T)
}
groups %>% resample_by_group(var=y, gp=group)second step: difference in means between (evidently two) grooups
mean_diff=function(d, var, gp) {
d %>% group_by({{ gp }}) %>%
summarize(m=mean({{ var }})) %>% pull(m) -> v
v[1]-v[2]
}
mean_diff(groups, y, group)[1] -2.5
bootstrap it
vis
or even
this distribution is discrete but more or less normal:
now do it on travel times: this is wrong, because I have to sample the rows properly
normal quantile plot
ok, but that still doesn’t solve bcanon. (I think, don’t solve that.)
Comments
This is not so bad: a long right tail, maybe:
or not so much.